Okapi Chinese Text Retrieval Experiments at TREC-6

نویسندگان

  • Xiangji Huang
  • Stephen E. Robertson
چکیده

The focus of the Okapi TREC{6 Chinese experiments is on investigating the e ectiveness of di erent automatic indexing methods and phrase weighting for retrieval based on probabilistic models over Chinese text. We compare di erent probabilistic weighting methods based on a range of word and single character approaches. There are two indexing methods used in our experiments. One indexing method is to use linguistic units (words, compound words and phrases) in texts as indexing terms to represent the texts. We refer to this method as the word approach. For this approach, text segmentation, which divides text into linguistic units, is regarded not only as a necessary precursor but also as a bottleneck of this kind of system [1]. The other method for indexing texts is based on single Chinese characters, in which texts are indexed by the characters appearing in the texts [2]. By using single character approaches, a search could be conducted for any multi-character word or phrase identi ed at search time, no matter whether this word or phrase is in the dictionary. Three automatic runs city97c1, city97c2 and city97c3 were submitted in TREC{6. All the three runs were based on the whole topic. City97c1 and city97c3 are for word indexing approach with di erent parameter values and city97c2 is for character indexing approach. The runs reported here are all on the TREC{6 collection of 26 new Chinese topics and 164768 documents. The Chinese dictionary we use for our word approach retrieval system contains about 70,000 Chinese words and phrases. Most of these words and phrases come from a manually constructed dictionary in China. We expanded this dictionary while working on the Chinese TREC experiments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

UCLA-Okapi at TREC-2: Query Expansion Experiments

This is the rst participation of the Graduate School of Library and Information Science, University of California at Los Angeles in the TREC Conference. For TREC{2, Category B, UCLA used a version of the Okapi text retrieval system that was made available to UCLA by City University, London, UK. OKAPI has been described in TREC1 (Robertson, Walker, Hancock-Beaulieu, Gull & Lau, 1993a) as well as...

متن کامل

TREC-10 Web Track Experiments at MSRA

In TREC-10, Microsoft Research Asia (MSRA) participated in the Web track (ad hoc retrieval task and homepage finding task). The latest version of the Okapi system (Windows 2000 version) was used. We focused on the developing of content-based retrieval and linkbased retrieval, and investigated the suitable combination of the two. For content-based retrieval, we examined the problems of weighting...

متن کامل

York University at TREC 2006: Legal Track

York University participated in the legal track this year. For this track, we developed an Okapi-based Legal Search Engine (LSE) v1.0. Our experiments mainly focused on evaluating the effect of a probabilistic text retrieval model on the legal domain. In order to address the special problems in legal text retrieval, new automatic feedback methods and term weighting methods are proposed and tested.

متن کامل

Chinese Document Retrieval at Trec-6 1 Multilingual Document Retrieval in Trec

The TREC-6 conference was the fourth year in which document retrieval in a language other than English was carried out. In TREC-3, 4 groups participated in an ad hoc retrieval task on a collection of 208 Mbytes of Mexican newspaper text in the Spanish language. In TREC-4 there were 10 groups who participated, once again in an ad hoc document retrieval task on the same Mexican newspaper texts bu...

متن کامل

TREC-9 Cross-Language Information Retrieval (English-Chinese) Overview

(English Chinese) Overview Fredri Gey and Aitao Chen UC DATA and SIMS University of California, Berkeley e-mail: gey u data.berkeley.edu,aitao sims.berkeley.edu Abstra t Sixteen groups parti ipated in the TREC-9 ross-language information retrieval tra k whi h fo ussed on retrieving Chinese language do uments in response to 25 English queries. A variety of CLIR approa hes were tested and a ri h ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997